Method for carrying out an audio conference, audio conference device, and method for switching between encoders

ABSTRACT

A method and an audio conference device for carrying out an audio conference are disclosed, whereby classification information associated with a respective audio date flow is recorded for supplied audio data flows. According to a result of an evaluation of the classification information, the audio data flows are associated with at least three groups which are homogeneous with regard to the results. The individual audio data flows are processed uniformly in each group in terms of the signals thereof, and said audio data flows processed in this way are superimposed in order to form audio conference data flows to be transmitted to the communication terminals.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is the US National Stage of International ApplicationNo. PCT/EP2006/007495 filed Jul. 28, 2006, claims the benefit thereofand is incorporated by reference herein in their entirety.

FIELD OF INVENTION

The invention relates to a method for carrying out an audio conference,an audio conference device and a method for switching between encoders.

BACKGROUND OF INVENTION

Speech conference systems allow a number of speech terminals to beconnected together into a telephone conference, so that a mixedconference signal which is picked up via respective microphones of thespeech terminals of the other participants is fed as a mixed conferencesignal to the respective participant for audio output. The mixedconference signal intended for a participant for output,—also referredto below as the mixed signal—is in such cases predominantly asuperimposition of all audio signals present, however frequently withoutthe audio signal of the participant, since the latter does not need tohear his own contributions to the conference and in fact should notusually do so, since this would actually cause a type of undesired echoeffect of what he is saying which the participant could find disturbing.Thus a specific mixed signal is frequently formed for each of the Nparticipants of a telephone conference in which the (N-1) voice signalsof the other participants of the telephone conference are processed intothe specific mixed signal. This can prove expensive in terms ofcomputing power for the audio conferencing system and entaildifficulties in understanding speech for participants involved in thetelephone conference since the respective mixed signal for example canalso include audio signals with background noises, with the backgroundnoises of a number of audio signals being able to be superimposed sothat they are clearly perceptible and adversely effect thecomprehensibility of the useful audio signals—i.e. the sentences spokenby one of the participants.

To reduce the computing outlay and the background noises it can beuseful, especially with telephone conferences with a comparatively largenumber of participants, not to superimpose all (N-1) speech signals ofthe N participants, but merely a subset of these N participants and inparticular especially a subset of M—with M<N—actively-speakingparticipants. The audio signals of the other, largely inactive,participants can be ignored in the creation of the mixed signal, so thatonly the M actively-speaking audio signals are superimposed. This methodof operation is based on the assumption that in a well-organizedteleconference led by a moderator only a few participants are speakingat the same time and usually speak chronologically after one another.

This type of method for a packet-switched communication system in whichan audio energy is determined for each conference participant, on thebasis of which a number M of conference participants are included in amixed signal and the remaining conference participants are not includedin the mixed signal is known, from the publication “Automatic Additionand Deletion of clients in VoIP Conferencing”, IEEE Symposium onComputers and Communications, Hammamet, Tunesia, July 2001 by Prasad,Kuri, Jamadagni, Dagale, Ravindranath. The particular characteristic ofthe method is that for each conference participant the mixed signal isformed individually at a terminal of the respective conferenceparticipant and each conference participant can adapt the volumes of themixed M signals themselves via a user interface. However this demands ahigh transmission bandwidth. Furthermore the publication mentions anupper limit of M=4.

If now—as with the method mentioned in the last section—the set ofactive and inactive participants is formed dynamically and adapted overthe course of time in accordance with audio signals present in the audioconference system to the current and changing activity circumstances,this results is disadvantages in the audio quality of the mixed signalon removal of previously active and now inactive audio signal from themixed signal or during insertion of a previously inactive and now activeaudio signal into the mixed signal. For example an abrupt appearanceand/or an abrupt disappearance of background noises can occur where anaudio signal of a participant features such background noises and thisaudio signal is determined for a period as active and for another periodas an inactive participant. In addition a crosstalk effect and atruncation of crosstalk audio signals can occur in the form of so-calledspeech clipping, which can be produced as a result of an incorrectcomposition of the audio signal viewed as active.

SUMMARY OF INVENTION

In addition the speech quality can suffer if channels of an audioconferencing system are dynamically created, the different input signalsdynamically mixed and connected to dynamically changing destinationparticipants, so that a state-prone method for encoding of mixed signalsto be transmitted for example during switchover from an original encoderto a further encoder can lead to encoding errors or during decoding at aparticipant device to decoding errors. For example this can occur if apreviously inactive participant becomes an active participant and forthis participant a new individual encoder and conference channel isinstantiated and an individual mixed signal is formed for thisparticipant by means of the individual encoder. The result for theparticipant is thus that the receiving encoded mixed signal is formedafter a point in time by another encoder based on another composition ofthe mixed signal. A decoder of a receiving participant terminal willthus receive the encoded mixed signal of the original encoder up to aparticular point in time and subsequently the encoded mixed signal ofthe further encoder. For an interim period this can result in adverseeffects on quality in audio output at the participant terminal.

An object of the invention is to specify an improved method and animproved arrangement to make it possible to carry out audio conferencesin an optimized manner.

This object is achieved by a method for carrying out an audio conferenceand an audio conferencing device as well as by a switchover method forswitchover between encoders and a further audio conferencing deviceaccording to the independent claims.

Advantageous developments and embodiments of the invention are specifiedin the dependent claims.

To carry out an audio conference in which the audio conference is fedaudio data flows from communications devices and for the audio dataflows classification information assigned to each audio data flow isdetected, the audio data flows are assigned in accordance with a resultof an evaluation of the classification information to at least threegroups which are homogenous with regard to the results. The individualaudio data flows are processed uniformly in each group in terms of thesignals thereof, and the signal-processed audio data flows aresuperimposed to form audio conference data flows to be output to thecommunication terminals.

Audio data flows in this case are especially encoded audio signals forcircuit-switched or packet-switched transmission, with the audio signalspreferably representing speech signals picked up at the communicationdevices by means of microphones. The communication devices can involvevoice terminals or video or multimedia terminals, with an audiocomponent of the video or multimedia data being viewed as the audiocomponent.

The signal processing and/or the superimposing of the audio data flowscan be undertaken directly for the audio data flows present in encodedform or for audio data flows decoded into audio signals. When audiosignals are used an audio data flow is decoded by means of a decoder ofa CODEC (encoding and encoding). After the signal processing and/or thesuperimposition of audio signals decoded in this way the audio signalswill be converted by means of encoding by a further or by the same CODECinto audio data flows for transmission to the communication terminals.

An evaluation of the classification information is especially to beunderstood as a comparison with reference values, with one result of theevaluation for example being the information that the classification isbelow a first reference value but above a second reference value.Furthermore during the evaluation of the classification information theclassification information of the audio data flows can be consideredseparately or can be interrelated. In addition an evaluation of theclassification information can be based on different logically connectedchecking steps. Furthermore the checking steps can be differentlyweighted so that for example an arrangement into a specific group can beforced by a specific checking criterion of a checking step not beingfulfilled.

The assignment into three homogeneous groups in relation to the resultis undertaken for example with two reference values such that all audiodata flows of which the classification information lies below the firstand the second reference value will be assigned to a first group whichand all audio data flows of which the classification information liesabove the first and the second reference value will be assigned to asecond group, and that all audio data flows of which the classificationinformation lies between the first and the second reference value willbe assigned to a third group.

The groups are homogenous or conformant such that all audio data flowswhich are assigned to a group deliver the same comparison or analysisresults.

The method is advantageous insofar as the complexity of the signalprocessing can be reduced since only a few different signals areprocessed—in accordance with the number of groups. In addition thecomprehensibility of the speech on receipt of the superimposed data flowcan be improved, since audio data flows regarded as important can beaccentuated in the superimposition, for example by changing a volume, atone level, a phase position or other audio parameters of the respectiveaudio data flows and conversely audio data flows regarded as unimportantcan be attenuated or processed in another way.

Preferably, in an advantageous embodiment of the invention, within theframework of the detection of one of the items of classificationinformation for one of the audio data flows at least one variablereflecting a characteristic of the audio data flow can be detected forthis purpose. This variable for example represents an audio level, avolume level, an audio energy, and can be determined by means ofmeasurement and/or signal analysis. The variables determined canpreferably be compared to reference values by means ofsimple-to-implement comparison operators, so that the audio data flowscan be divided up for example into loud, quiet and completely mutedaudio data flows.

Preferably such a determination of audio parameters can already beundertaken by a CODEC during decoding of a respective audio data flowsince decoding can be undertaken in any event before signal processingand before a superimposition of the different audio data. In this way anexisting component of an audio conferencing system—the CODEC—can be usedfor implementing parts of the method steps of the invention.

A further variable to be used for grouping the audio data flows is forexample a speech frequency value, which represents a relationshipdetermined over a period of time between speech activity and speechinactivity. Together with the analysis of a volume level the speechfrequency value can be employed for a precise distinction betweenspeakers who are only active once or are rarely active and speakers whoare active over long periods—but possibly with short interruptions.

For such analyses it is especially advantageous for not only absolutevalues which reflect a characteristic of the audio data flows to beevaluated but for the relationship of the variables to each other toalso be included in the result of the evaluation. Thus for example agroup can be formed from the two most active audio data flows,regardless of whether all audio data flows are now transmitting morequiet speech or more loud speech. This type of classification of theaudio data flows can be determined by comparing classificationinformation between the audio data flows, or by dynamic adaptation ofthe absolute reference values for the comparison with the classificationinformation.

In an advantageous embodiment of the inventive method the uniform groupsignal processing can comprise a group-specific attenuation or anegative amplification of the audio data flows. Thus a signal levelreduction of undesired contributions in the audio conference can beachieved, for example of audio data flows which merely containbackground sounds and background noise as audio signals. A grouping fora group-specific attenuation is advantageous in such cases to the extentthat human hearing can only perceive significant changes in volume andthus a complex, individually different attenuation may possibly not evenbe perceived. By contrast the attenuation factor is preferably to befreely configurable or dynamically changeable to enable a flexiblereaction to be provided for different influences such as the number ofparticipants in the audio conference.

As well as an attenuation in a similar manner a group-specificamplification of audio data flows is also to be possible, so that theprocess can be generally referred to as a signal strength correction.

Furthermore—especially with use of stereo output at the communicationterminal of the audio conference—the audio data flows can be processedas a uniform group in relation to a phase position of their decodedaudio signal, so that audio signals of audio data flows considered asimportant appear in the middle of a virtually perceivable stereodistribution, whereas audio signals of audio data flows considered asunimportant in their phase position are processed as binaural signalssuch that their position in a virtually perceivable stereo level is feltby the listening conference participant as if it were arranged on theleft or right edge of the virtually perceivable stereo level.

In an advantageous development of the invention the evaluation of theclassification information can comprise an assessment of apredeterminable preselection individually assigned to one of the audiodata flows so that, depending on the group preselection, the assignmentto a preselected group can be forced although for example, with soleevaluation of an audio activity comparison, membership of another groupwould be produced. For example it frequently transpires in real audioconferences that the participant who initiates the audio conference alsoassumes the function of a moderator during the course of the audioconference. It might thus be sensible to sort this participant into apreferred group, regardless of whether he or she has made a speechcontribution in a time segment of the audio conference or not, since itcan frequently be important for the moderator of the audio conference,even if he or she just speaks quietly, to be clearly understood by allparticipants of the audio conference.

The preselection of the group can for example be undertaken using agraphical user interface for control of the audio conference via whichthe participants of the conference are assigned different roles.Alternately the roles can also be performed by entries via the speechterminals used in each case. A role is for example “muted”, whichclassifies a participant whose microphone is switched off and is merelylistening. Another conceivable role is “exclusive”, in which only theaudio data flows of the speaker identified as “exclusive” are includedin the superimposed mixed signal of the audio conference and the audiodata flows of the other participant are completely suppressed by meansof attenuation.

Furthermore the audio data flows can be assigned priorities, with thepriorities likewise being evaluated for division into groups. Viewed ingeneral terms a number of criteria can be checked during division intogroups, with the evaluation results being logically linked to eachother. For example on the one hand the audio activity can be observed,together with a group preselection for individual participants andtaking into account allocated priorities.

Individual items of the classification information can preferably bedetermined directly from an audio data flow or its decoding audiosignal. Other classification information might be able to be determinedby interrogating a configuration, with a configuration being able to beperformed statically or dynamically, for example via a browserapplication controlling the audio conference.

Preferably the detection and the evaluation of the classificationinformation and the division into groups for time intervals can beundertaken for periods of the audio conference, so that the audio dataflows while the audio conference is being conducted can be assigned overtime to different groups assigned. This enables the division into groupsto be adapted to the audio data flows, e.g. in accordance with acurrently occurring speech activity, so that on division into activespeakers and inactive listeners at the relevant point in time a divisionlargely always corresponding to the actual circumstances is undertaken.In a transmission in voice data packets or in so called frames a timeinterval can for example correspond to precisely one voice data packetor to a frame—or an integer multiple thereof.

In an advantageous development of the invention the classificationinformation for a time interval can be detected and evaluated byincluding the evaluation of the classification information of a previoustime interval. This allows the situation to be prevented in which apreviously active participant is already sorted into a new group thefirst time that they become inactive although there is a highprobability that only a short speech pause is involved. By including theevaluation of the classification information of a previous time intervala type of hysteresis can preferably be achieved in which a transition ofa grouping in an active group to a grouping in an inactive group isdelayed—after one or more time intervals have elapsed for which aninactivity is detected—can be undertaken. By means of this method ofoperation it can be ensured that the group membership is not changed toofrequently for an audio data flow and a number of group changes over thecourse of time can be kept small.

In a further advantageous embodiment of the invention, on detection of achanging assignment of one of the audio data flows from an originalgroup in a first time interval to a further group in a second timeinterval, the audio data flow for a predefined time segment can neitherbe assigned to the original nor to the further group, but will behandled as an individual audio data flow. If for example an assignmentinto the original group is characterized by signal processing by meansof lower attenuation of the audio data flows and an assignment into thefurther group is characterized by means of signal processing by means ofstronger attenuation of the audio data flow, the result achieved can bethat an attenuation changes for individual audio data flows startingfrom a first low attenuation provided for the original group to thesecond high attenuation provided for the further group constantly inaccordance with a monotonous function and/or in discrete steps over thetime segment. The outcome can be that no abrupt, clearly perceptiblechanges are made in the attenuation curve, but rather a softertransition between the two attenuations is achieved.

Thus for example for a switch of groups an audio data flow can be raiseddynamically from e.g. −94 dB (dB decibel) to −15 dB, so that no hard,perceptible transition occurs. This especially enables it to be ensuredthat background noises of a speaker do not appear suddenly or disappearsuddenly. Preferably the amplification or attenuation values can befreely selectable, by means of configuration for example.

Preferably for a rising audio signal edge a switch from a highattenuation to a low attenuation can occur more quickly, so that nouseful speech information gets lost. On the other hand it can beadvantageous for a falling audio signal edge, on a switch from a lowattenuation to a high attenuation, to switch the attenuation factorslowly with intermediate steps in order to ensure a soft fading out ofthe audio data flows.

In the method in which audio data flows are sorted into groups, it canbe advantageous also to evaluate this grouping information for areduction of the encoding effort for audio conference data flows to beoutput at the communication terminals. Thus for example precisely thesame superimposed audio conference data flow can be transferred to theaudio data flows a group of muted audio data flows, since the audioconference data flow can be restricted to the superimposing of activeaudio data flows.

By contrast it can be sensible to transfer to the audio data flowsassigned to a group of active participants an audio conference data flowsuperimposed for individual audio data flows in which its own speechcomponent is filtered out. Thus a separate CODEC would be necessary forthis for each audio data flow for creating the respective audioconference data flow, whereas for the case given above for thetransmission of a common audio conference data flows only one CODEC canbe used for a number of audio data flows.

Since in—previously mentioned—advantageous developments of the inventionthe group membership changes dynamically over time, the result is that,to save on CODECs, an audio data flow could be applied to another CODECif group membership changes. However such a switchover creates—at leastwith state-dependent CODECs—undesired and/or unnatural sound effects,which greatly reduces the quality of the audio conference data flows.

This problem is addressed in the switchover process, in which between afirst encoder and a second encoder for an audio data connection whichexists between the first encoder and a decoder, especially to carry outan audio conference with the previously mentioned features, thedecoder—especially a communication terminal—is fed encoded audio data bythe first encoder—especially an audio conference device. In such casesthe first encoder is characterized in that the encoded audio data iscreated by this encoder using an encoding parameter influenced by anaudio data history by means of encoding and/or signal processing from afirst audio input signal fed to the first encoder. In addition theencoding parameters of one of the two encoders at a current time segmentare formed in each case by a fed audio input signal of the current timesegment as well as through the audio input signal of at least oneprevious time segment. In the switchover method the audio dataconnection from the first encoder to the second encoder is switched oversuch that, within the framework of the switchover the encodingparameters of the second encoder are matched to the encoding parametersof the first encoder, and when the match occurs between the encodingparameters the audio connection is switched over to the second encoder.

This is a way of preventing any loss of quality occurring during theswitchover from the first encoder to the second encoder, since bothencoders have the same encoding parameters influencing the encodingprocess at the switchover point. Thus the decoder receives encoded audiodata through a continuous method in which no discontinuities arise inthe signal waveform. Decoding parameters based on previous time segmentspossibly likewise provided in the decoder thus continue to be valid andcan continue to be used by the decoder after the switchover of theencoders. Decoding errors because of the switchover of the encoders canthus be prevented.

The switchover method is especially of advantage for compression CODECssince in many known compression encoding methods, preceding timesegments are included for achieving a high compression factor.

Achieving a match between the encoding parameters of the two encoderscan alternately be undertaken in a current time segment or in a futuretime segment, with the switchover process thus being able to extend overa number of time segments.

Advantageously after an encoding parameter match and the switchover tothe second encoder have been achieved, resources of the first encodercan be released, since both encoders create the same audio data. Thenumber of encoders used simultaneously in an audio conferencing devicecan thus be reduced and thereby the overall processing complexity of theaudio conferencing device greatly reduced.

In an advantageous embodiment of the switchover method, within theframework of switching over the audio data connection from the firstencoder to the second encoder the first audio input signal can bemodified such that the second encoder in a future time segment is putinto the same state as the first encoder. This is advantageouslyachieved by, before the actual final switchover of the encoders, anaudio input signal fed to the second encoder also being fed to the firstencoder. In this way the first audio input signal of the two encoders isidentical, so that after a preferably already known number of timesegments have elapsed, the encoder parameters synchronize until theybecome identical at a particular time segment. From this point onwardsthe switch can now be made to the second encoder and the first encodercan be deactivated and/or released.

In an alternate, advantageous embodiment of the switchover method,within the framework of the switchover of the audio data connection fromthe first encoder to the second encoder, a state of the second encoderis modified such that the encoder parameters of the first encoder aredetected and set as encoder parameters for the second encoder. Thisprocess is preferably undertaken at the end of a time segment or betweentwo time segments, so that the switchover to the second encoder canalready be made during the following time segment.

So that the switchover process can actually be undertaken without anyadverse effects on quality, the first encoder and the second encoder canespecially use the same encoding algorithm, with possible configurationsof the encoder preferably being the same. In this way, when theswitchover takes place, the decoder does not learn anything about theswitch between the first and the second encoder and can continue tooperate unchanged with its decoding algorithm.

As regards the match with the encoder parameters it should be pointedout that an extensive match is involved in this case in which at leastthose encoder parameters with the greatest influence on the audioquality are similar and/or identical. A complete match between theencoder parameters in their full scope is absolutely necessitated,provided this does not have any perceptible negative effects on thecomprehensibility of the speech.

BRIEF DESCRIPTION OF THE DRAWINGS

Exemplary embodiments of the invention will be explained in greaterdetail on the basis of a drawing.

The figures show schematic diagrams as follows:

FIG. 1 a block diagram of an audio conference device,

FIG. 2 a block diagram of a classification device contained in the audioconference device,

FIG. 3 function curves for three audio data flows over a respectiveperiod of time relating to an audio signal of one of the audio dataflows, a resulting curve of a group assignment and a curve of anamplification factor,

FIGS. 4-7 function curves for an audio data flow over a period of timeof a grouping assignment and of a curve of an amplification factor, and

FIGS. 8-12 block diagrams for illustrating various method states of animplementation of the switchover method within the framework of an audioconference.

DETAILED DESCRIPTION OF INVENTION

FIG. 1 shows a schematic diagram of an audio conference device. In thisdevice audio data flows ADS1, ADS2, ADS3, ADS4 fed from communicationterminals not shown in the diagram are fed to a decoding device DEKOD ofthe audio conference device. The restriction to four audio data flows inthis diagram is merely by way of an example and three dots are used as asymbol to indicate further flows omitted from the diagram. In thedecoding device DEKOD the incoming audio data flows ADS1 to ADS4 areconverted by means of decoders D1, D2, D3, D4 into decoded audio signalsAS1, AS2, AS3, AS4. These audio signals AS1 to AS4 are fed to aconference processing device KVE in which the audio signals AS1 to AS4are processed, so that audio conference signals to be output arecreated. These audio conference signals are fed to an encoding deviceKOD comprising a set of encoders K1, K2, K3, K4 and possibly furtherencoders. This encoding device KOD encodes the audio conference signalsinto audio conference data flows KADS1, KADS2, KADS3, KADS4, which arerespectively fed to the communications devices. The conferenceprocessing device KVE especially comprises three series downstreamcomponents which process and analyze the incoming audio signals AS1 toAS4. These are especially a grouping device GE, an amplification deviceVE as an inventive signal processing unit and a mixing device MIX as asuperimposition unit.

The grouping unit GE is provided in this case for forming homogeneousgroups of audio data flows and for example features the respective audiosignals AS1 to AS4, grouping information GI_(AS1), GI_(AS2), GI_(AS3),GI_(AS4) describing a grouping or impresses such grouping informationGI_(AS1) to GI_(AS4) onto the respective audio signals AS1 to AS4, withthe grouping information GI_(AS1) to GI_(AS4) being transferred to theamplification unit VE together with the audio signals AS1 to AS4.Furthermore the amplification unit VE is provided for signal processingof the audio data flows ADS1 to ADS4 or their associated audio signalsAS1 to AS4 by application of an amplification or attenuation factor. Themixing device MIX is used in such cases for forming superimposed audiosignals from the audio signals AS1 to AS4 within the framework of theaudio conference. A classification device KLASS as a classificationinformation processing unit is shows as a further component of the audioconference device in FIG. 1, to the inputs of which the audio signalsAS1 to AS4 are applied. The classification device KLASS with be examinedin greater detail in a later section on the basis of FIG. 2.

The classification device KLASS in this case is intended, by evaluatingor analyzing the incoming audio signals AS1 to AS4, to undertake agrouping or classification of the audio signals AS1 to AS4 and thus alsoof the audio data flows ADS1 to ADS4 into homogeneous groups in respectof an evaluation of classification information, and to make thisinformation available by means of grouping information GI to thegrouping unit GE. In addition the classification device KLASS providesthe amplification device VE with amplification factor information VI, bymeans of which the extent—and especially by which factor—the respectiveaudio signal groups are to be amplified or attenuated.

A sequence of the method for executing an audio conference will now beexplained further with reference to FIG. 1. In this case N audio dataflows, with only the audio data flows ADS1 to ADS4 being consideredbelow, are fed to the audio conference device. It should be noted herethat, although all audio data flows ADS1 to ADS4 transmit useful speechinformation, from the semantic viewpoint only few audio data flowscontain an active contribution to the audio conference. It can thus befor example that within the audio conference at one time only one activespeaker is present, with all other participants listening and beinginactive. It is further to be distinguished that listening participantsmight still possibly make an audible contribution to the audioconference, because of background noises, which will be transmitted bymeans of one or more of the audio data flows ADS1 to ADS4 to the audioconference. In addition there can be muted downstream audio conferenceparticipants who possibly by means of a statically or dynamicallymodifiable configuration are to be switched to completely muted althoughaudio signals are being transferred in their respective audio dataflows. Also by active muting of a communications device by actuating amuting service feature incoming audio data flows to the audio conferencecan be made to contain no speech and/or tone information.

The audio data flows ADS1 to ADS4 are now converted in time segments bymeans of the decoding device DEKOD into the audio signals AS1 to AS4,with the last-mentioned of the conference processing device KVE as wellas the classification device KLASS being provided. The classificationdevice KLASS now detects and/or determines for the respective timesegment classification information assigned to the respective audiosignals AS1 to AS4—and thus similarly also to the assigned audio dataflows ADS1 to ADS4. This is especially signal loudness, level a maximumimpulse or a signal energy of the respective audio signal AS1 to AS4. Anevaluation of the recorded classification information can now byundertaken by the classification unit KLASS such that, on the basis ofthe signal loudness level, groups of audio signals or audio data flowsare formed. In such cases for example a first group of active speakerscan be defined in which all simultaneously actively-speaking conferenceparticipants are included. Above and beyond this a second group ofrarely actively speaking participants can be formed as a further group,for whom primarily background noises are relevant for the respectivetime segment. In addition a group of muted participants can be formed asthe third group who, on the basis of a configuration which also appliesas classification information, are set permanently to inactive. Withsuch a classification three homogeneous groups would thus be formed, andthese would be a first group of active participants, a second group ofinactive participants and a third group of muted participants. Therespective groups contain only the respective audio data flows which canbe allocated in accordance with the detected classification informationto the respective group.

The group membership of the audio signals AS1 to AS4 or of the audiodata flows ADS1 to ADS4 is transferred after determination by theclassification device KLASS by means of the grouping information GI tothe grouping unit GE, so that this can undertake a grouping for theaudio signals AS1 to AS4 in accordance with the grouping information GI.In addition the classification information KLASS provides amplificationfactor information VI of the amplification device VE, with for eachgroup an individual amplification factor value being able to be set foruse within the framework of signal processing. For example it can be setfor the group of active speakers that no signal processing is to beundertaken by means of an amplification or attenuation and thus theaudio signals of this group are to remain unchanged. By contrast auniform negative amplification across the group can be set for the groupof inactive participants—for example a halving of the volume in order toreceive more quietly the sound signals to be regarded preponderantly asdisruptive noise. For the third group of muted participants a uniformvery high attenuation can be defined for the group so that no signals orbarely perceptible signals are to be detected in the mixed signal afterapplication of this signal processing.

The amplification device VE now applies preconfigured or dynamicallydetermined group-specific amplification factors for the audio signalsAS1 to AS4 based on the grouping information GI_(AS1) to GI_(AS4)transferred by the grouping unit GE and thereby weights the audiosignals AS1 to AS4 of the respective groups in accordance with theirgrouping. This weighting is undertaken individually for the respectiveaudio signals AS1 to AS4 by means of individual signal processing.Afterwards these weighted signal-processed audio signals are processedby means of mixing or superimposition by the mixing device MIX into anumber of audio conference signals, which after an encoding by theencoding device KOD are fed as respective audio conference data flowsKADS1 to KADS4 to the communication devices of the audio conference.

The advantage of this method of operation is especially that in this wayaudio conference contributions deemed to be important can be deliveredunimpeded or even amplified to the audio conference data flows, whereasaudio signals regarded as unimportant can be attenuated or filtered out.This method thus serves for clarity and comprehensibility of the speechof the mixed audio conference signals to the respective receivingcommunication devices.

As an alternative to the arrangement shown an explicit grouping unit GEcan be dispensed with (not shown). In this case can the amplificationdevice VE can evaluate the grouping information GI and the amplificationfactor information VI jointly and based on this can undertake agroup-specific amplification of the audio signals AS1 to AS4. Above andbeyond this a grouping unit GE can alternatively be arranged outside theaudio path of the audio signals AS1 to AS4 (not shown), since amodification of the audio signals AS1 to AS4 is not absolutely necessaryfor implementing the invention.

Furthermore, as an alternative to the arrangement shown, aclassification device KLASS can also use the audio data flows ADS1 toADS4 directly as input signals—by contrast with the evaluation of theaudio signals AS1 to AS4 explained above. In addition it can beadvantageous, on the one hand to provide the audio data flows ADS1 toADS4 and on the other hand the decoded audio signals AS1 to AS4 jointlyto the classification device KLASS, since on the one hand signalinginformation in the audio data flows ADS1 to ADS4 can be evaluatedtogether with a signal analysis of the audio signals AS1 to AS4.

A classification in the classification device KLASS is undertaken alongwith an analysis of absolute values, especially by forming relativerelationships between the audio signals AS1 to AS4 and/or through globalframework conditions which can also be noted.

Forming relationships between the audio signals AS1 to AS4 is understoodin this case for example as taking account of relative relationshipsbetween the audio signals AS1 to AS4 in which it is especiallyadvantageous that thereby for example, if all audio signals have a lowlevel, a grouping into different groups can still be undertaken—therelatively loudest of the audio signals AS1 to AS4 is for example set ina group of active speakers—whereas with a purely absolute assessment,all audio signals may possibly have been assigned to a common group.

The global framework conditions are especially a maximum upper limit ofa group set, whereby if more audio data flows were assigned to a groupthan could be included as group members, one or more of the audio dataflows can be assigned to an alternate group.

A more detailed examination of the classification device KLASS isundertaken below with reference to FIG. 2.

Analysis device components of the classification device KLASS aredepicted schematically in FIG. 2. Inputs of the classification deviceKLASS are once again the audio signals AS1, AS2 and further audiosignals not depicted—for example AS3 and AS4. Different analysis devicecomponents are called for a respective audio signal AS1, AS2, . . . .These are especially a signal energy determination unit SIGNE and anactivity determination unit AKTE, which are provided for each audiosignal. FIG. 2 further shows a priority determination unit PRIO for eachaudio signal, which assigned to each audio signal AS1 or AS2 takes noteof a group preselection or a predetermined priority of the audio signal.Further components analyzing the respective audio signal are indicatedby three dots below the components PRIOE, SIGENE and AKTE to indicatecomponents omitted from the diagram.

The results of the respective analysis devices are fed jointly for allaudio signals AS1, AS2, . . . to an evaluation device BWE as theevaluation unit. This evaluation device BWE now determines based on thesupplied information of the analysis devices regarding priorities,regarding the respective signal energy and regarding the respectiveaudio activity, the group at a specific time segment to which anassigned signal belongs. The result can be that for example the audiosignal AS1 in accordance with an evaluation by the evaluation device BWEis assigned to a group of the active speakers, whereas the audio signalAS2 is assigned to a group of inactive participants. The analysis isundertaken anew for each time segment, with possible analysis results ofpreceding time segments also being included for a current time segment.

The information regarding the group membership is now transferred by theevaluation device BWE by means of the grouping information GI to thegrouping unit GE not shown in FIG. 2. Over and above this the evaluationdevice BWE transfers group-specific amplification factor information VIto the amplification device VE not shown in the figure. Theamplification factor information VI is on the one hand influenced by thegroup membership, on the other hand by the number of audio signalspresent at the classification device KLASS. It can thus make sense,depending on the number of conference participants, to provide differentamplification factors, depending on how many conference participants aretaking part in the audio conference. For example with a small conferencethere can simply be a division into two different amplificationfactors—an amplification of 0 DB for all active and less activeparticipants of the audio conference and a total muting by means ofsetting an amplification of −94 DB for completely inactive or mutedparticipants. With a larger number of participants of an audioconference it might on the other hand be sensible to undertake a moregranular division of amplification. In this case for example activespeakers can continue to be processed unattenuated—with 0 DBamplification—whereas quiet speakers who are acting in the background,experience a halving of their volume for example and inactiveparticipants, who are merely partly active, will be processed by afour-times attenuation.

A number of groups can be configured or can be predetermined for theconference. Two possible embodiments will be mentioned below asexamples. Here a categorization into three groups is undertaken in afirst embodiment: The first group comprises active speakers of the audioconference, the second group background speakers and the third groupmuted or inactive participants of the audio conference The distinctionbetween active speakers and background speakers can in this case forexample be undertaken in accordance with a predetermined priority butalso by signal processing of the volume or of the signal energy for oneor more time segments. A second possible embodiment is for example adivision into a first active speaker as a first group, a second activespeaker as a second group, further active speakers as a third group,background speakers as a fourth group and inactive or muted participantsas a fifth group. With this type of granular grouping, switches betweenthe groups may be able to be undertaken without perceptible changes inthe audio conference data flow, because as a result of the highgranularity a graduation of the amplification factors can be undertakenmerely by means of small changes.

A change between groups can be undertaken for an audio data flow foreach time segment which will be considered. In this case however ahysteresis can be additionally noted, through which a switch from agroup to another is undertaken possibly with a delay, in that a check ismade as to whether the grouping into a further group has taken placeover a number of time segments. The group preselection mentioned is forexample a permanent assignment of an audio moderator in a group ofactive participants, so that the moderator can participate at fullvolume in the speech conference at any time. A prioritization ofparticipants can for example be undertaken by means of configuration ata communication terminal or a data terminal—especially via anapplication controlling the audio conference on a workstation computer.Preferably for control of a conference a web page in a browser can beprovided, by means of which the individual participants can be allocatedroles. For example individual participants can be assigned a permanentinactivity, so that these people can only participate in the conferenceas listeners. Such an allocation of priorities can possibly also bechanged dynamically by the moderator during the course of conducting theaudio conference.

A signal processing by means of attenuation or negative amplification ofgroup members especially has the advantage that the participants whomake an active contribution to the speech conference remain clearlyperceptible while other participants who merely produce disturbancenoises can be slightly or possibly even heavily attenuated. Since theconnection and disconnection of participants with background noiseswould however have an unpleasant effect for the listening conferenceparticipants, since background noises would occur or disappear againfrom one moment to the next, it makes sense, for a switch from activityto complete inactivity, to let a participant pass in stages through anumber of groups, with the respective groups being characterized bydifferent attenuation factors. Thus a participant can move from timesegment to time segment, from a group which is unattenuated step-by-stepvia a slightly attenuated group to a very heavily attenuated group. Forthe conference participants, after mixing this produces a mixed audioconference signal in which the background noise of one of theparticipants is slowly faded out.

If on the other hand a participant who was already muted should suddenlybecome active, the transition to an active group must be completedrelatively quickly, since otherwise useful speech information of thisparticipant would be lost. Such a behavior can for example be undertakenby evaluating a filtered signal energy of the respective audio dataflow—not shown in FIG. 2—by the filter and/or smoothing being performedby a first-order FIR filter (FIR: Finite Impulse Responder) withdifferent filtering coefficients for the rise or fall or a signal edge.A comparison of the filtered audio signal with reference values can thusdeliver a grouping into different groups. Only if the filtered audiosignal has fallen below a specific threshold value—with this if possibleoccurring as a result of the filtering only in one of the following timesegments—an audio signal is sorted back into a group which describes theinactivity of the participants.

Thus FIG. 2 could be expanded such that, based on the audio energy andthe audio activity, a smoothed and/or filtered audio energy will bedetermined by a further component and on the basis thereof theclassification information will be determined.

Different signal curves of audio input signals and classificationinformation are now shown in the following figures.

FIG. 3 is a schematic curve of functional sequences plotted on a timeaxis for three participants TLN1, TLN2, TLN3 of an audio conference.Typically specified for each participant TLN1, TLN2, TLN3 a curve of thesignal energy SIGE1, SIGE2, SIGE3, a function with classificationinformation KLASS1, KLASS2, KLASS3 as well as a function about the audiosignal amplification set V1, V2, V3.

The curve of a signal energy SIGE1 determined from an audio data flow ofthe first participant TLN1 is characterized in that, at a point in timeT7 no signals occur, whereas between the points in time T7 and T8 asignal energy which is other than zero occurs. Between the point in timeT8 and T9 the audio data flow of the first participant TLN1 once morecontains no speech information, so that the signal energy SIGE1 in thisperiod is again zero. By contrast, at point in time T9, the firstparticipant TLN1 becomes active again, which has the effect of producingamplitudes of the signal energy curve SIGE1.

The participant TLN2 is characterized by extensive inactivity on theiraudio data flow, so that the curve of the signal energy SIGE2 is largelyzero. Merely in the time segments T1 to T3 as well as T5 to T6 does thecurve of the signal energy SIGE2 possess small peaks with low amplitude.This can for example be achieved by quiet speech transmitted by means ofthe audio data flow or by occurrence of background noises.

The participant TLN3 is permanently inactive with the exception of thetime segment T2 to T4 and has a signal energy SIGE3 of zero. Merely inthe time segment T2 to T4 does the third participant TLN3 participate inthe audio conference, which is indicated in the signal waveform of thesignal energy SIGE3 by peaks of the curve.

For FIG. 3 let it be assumed that the audio conference device isconfigured such that only two classifications of amplification factorsare provided. These are an amplification factor of 0 DB for an activespeaker and an amplification factor of −30 DB for an inactive speaker orbackground speaker. Let these values in this case merely be by way ofexample and able to be preferably configured system-wide orindividually. A classification is undertaken in this example into threegroups K1, K2 and K3. The first group K1 in this case represents anactive speaker or a participant of whom it is expected that he has thehighest probability of becoming active again. The second group K2contains a participant who at a time segment is either little active isor was at least active at a previous point in time. The third group K3represents a completely inactive participant, who by comparison with theother audio conference participants possesses a lower weighting.

Since in the present example only three audio conference participantsTLN1, TLN2, TLN3 are participating in the audio conference, the maximumgroup strength of the first group K1 and of the second group K2 will beset in each case to an individual participant. The result is thus thanan active participant who is assigned at a point in time to the firstgroup K1 will possibly be resorted into a second group K2 although hecontinues to be active provided one of the other conference participantsissues a louder speech signal and the effect of this is a higher levelof the respective signal energy curve.

Let the initial situation be that all three participants TLN1, TLN2,TLN3 are inactive. In this case let the basic state of theclassification be into the three groups K1, K2, K3 such that firstparticipant TLN1 is presorted into the first group K1 whereas the secondparticipant TLN2 is assigned to the second group K2. Let the thirdparticipant T3 in the initial situation be assigned to the third groupK3. This can for example correspond to a priority defined in advance. Inaccordance with this grouping the original amplification factor for theaudio data flow of the first participant TLN1 is set to 0 dB while theamplification factor for the two further participants TLN2, TLN3 is setto −30 dB.

The classification information corresponds in the present exemplaryembodiment to a height of a signal energy, as is entered in curvesSIGE1, SIGE2, SIGE3. The detected classification information is relatedto each other in an evaluation not shown, so that in accordance with theevaluation a division into the groups K1 to K3 can be undertaken.

Since as from point in time T1 through transmission of zero via theaudio data flow of the second participant TLN2 different speech signalsare present and this is detectable by means of the signal energy curveSIGE2, the second participant TLN2 is assigned to the group K1 since heis the only participant to fulfill the classification information forthis group K1, of being over a certain threshold value of signal energy.The first participant TLN1 is then assigned from his previous group K1into the next group K2—because of a maximum group strength of aparticipant in group K1. The third participant TLN3 can remain in thegroup K3.

At point in time T2, in addition to the second participant TLN2 thethird participant TLN3 now becomes an active speaker, with his speechsignal energy level very largely clearly higher than the signal energylevel of the second participant TLN2. When the signal energy curvesSIGE2 and SIGE3 are considered it is shown here that the curve of thethird participant TLN3 predominantly runs with greater amplitudecompared to the curve of the second participant TLN2, with individualpeaks of the signal energy curve SIGE2 exceeding the signal energy valueof the third participant TLN3. In the sections in which the secondparticipant TLN2 now has the highest signal energy, this participantTLN2 will be assigned to the highest group K1. In this case the activeparticipant TLN3, because of the maximum group strength of one, will beassigned in the second group K2. However if the effect is reversed, sothat the third participant TLN3 has a higher signal energy than thesignal energy of the second participant TLN2, the third participant TLN3will be assigned to the first group K1 whereas the second participantTLN2 is allocated to the second group K2. The completely inactiveparticipant TLN1 on the other hand will be sorted into the lowestgrouping level K3.

In FIG. 3 the subdivision into time segments for an analysis of theaudio data flows or of the signal energy is shown in a very finegranular way, so that the curves of the classification KLASS1, KLASS2,KLASS3 as well as the amplification V1, V2, V3 appear to have acontinuous curve, although actually only at discrete points in time isan evaluation in relation to the classification information undertaken,so that also only a time segment by time segment analysis is performedat the discrete points in time.

In accordance with the classification of the participants TLN2 and TLN3into the groups K1 and K2, the amplification factors are now also setaccording to the group classification. Thus the amplification factor V2changes for the second participant TLN2 depending on their groupingbetween an amplification factor value of 0 DB and an amplificationfactor value of −30 DB. Analog and reciprocally to the secondparticipant TLN2 the amplification factor 0 DB and −30 DB is also setalternately for the participant TLN3, depending on their inclusion inthe groups K1 or K2.

After the speech signal of the second participant TLN2 has ended atpoint in time T3 only the third participant TLN3 is temporarily activelyspeaking. Thus the third participant TLN3 is assigned in thehighest-priority group K1 whereas the second participant TLN2 isarranged in the next available group K2. The participant TLN1 remains inthe group K3, as in the previous time segment.

As from the point in time T4 none of the three conference participantsis actively speaking. In the present exemplary embodiment this meansthat all participants stay in the previously allocated group. For thefirst participant TLN1 this would be the group K3, for the secondparticipant TLN2 the group K2 and for the third participant TLN3 thegroup K1. By contrast, in an alternate embodiment not shown, allparticipants TLN1, TLN2, TLN3 could be assigned to the third group K3 tothe inactive participants.

Following the time curve three further time segments occur in FIG. 3, inwhich a participant becomes an active speaker in each case, whereas theother participants do not speak. In all three cases the reactionproduced by the audio conference systems is that the respective activelyspeaking participant is assigned to the group K1 and the participantpreviously assigned to the group K1 is sorted into the group K2. Aparticipant already assigned in the group K3 remains in this group and aparticipant assigned to the group K2, provided he is inactive forspeaking, is assigned to the group K3.

FIG. 3 shows the way in which an evaluation or analysis of theclassification information is undertaken and how a signal processing ofaudio data flows can be made dependent thereupon. Because of the smallnumber of participants in the example, the group strength has been setto one group member in each case, with greater group strengths—at leastfor some of the groups—possibly being sensible in other implementations.

With reference to FIGS. 4 to 7, further temporal function sequences ofthe classification information and the amplification are now illustratedwithin the framework of function diagrams. In such cases a presentationof a curve of the audio activity is omitted from these diagrams. FIGS. 4to 7 also further differ from FIG. 3 in that only the curves for oneparticipant of the audio conference are shown and that the respectivetime segments occupy a clearly recognizable segment on the time axis t.

In FIGS. 4 to 7 a classification is undertaken into four classes. Oneclass represents the group of active speakers and is labeled ACT. Afurther group represents the background speakers of an audio conferenceand is labeled HG. A third group is labeled INACT and representsinactive participants of the speech conference. In addition a fourthgroup MUTE exists which represents permanently muted participants.Grouping of a participant or their audio data flows into one of thecorresponding categories is entered on the y-axis of the classificationcurve K. The x-axis represents a time axis t, with the classificationonly being analyzed or evaluated at discrete points in time.

Below the classification information curve K, in a separate diagram, anamplification V is plotted, with the time axis t likewise being plottedon the x-axis of the diagram and the time axis t corresponding to theclassification curve K. Amplification factors which are labeled for theFIGS. 4 to 7 as G1, G2, G3 and G4 are plotted as the y-axis. Letamplification factor G1 be an amplification of 0 DB in this case,amplification factor G2 an amplification of −6 DB amplification factorG3 an amplification of −15 DB and amplification factor G4 anamplification of −94 DB, with the negative amplification factor beingused again for an attenuation of the audio signals of the conference.These amplification factor values are however merely examples and can beadapted—by system-wide static configuration or conference-individualsettings—depending on implementation.

FIG. 4 shows the curve of the classification K and the amplification Vof a participant of an audio conference for an audio conference with fewparticipants. Because of the few participants the audio conferencedevice is configured such that only two amplification factors can beset. These are the amplification factors G1 for a grouping into thegroups ACT, HG and INACT as well as the amplification factor G4 for agrouping into the groups of the muted participants MUTE.

In the observation period, from time Start up to time End, the observedparticipant is now assigned to the groups ACT, HG, INACT, MUTE,depending on whether he is actively speaking, especially in relation toa detected speech activity of further conference participants of theaudio conference. Thus for example in a first time segment an assignmentto the group of active speakers ACT is produced. In a second timesegment however an assignment to the group of inactive speakers INACT.Over the course of time the assignment now switches to the groups inaccordance with the speech activity of the participants. In addition theobserved participant, in a fourth time segment, switches from active tomuted which is evident from the assignment to the group MUTE in theclassification curve K. This can be done for example by the participantactuating a key for muting the input microphone.

An amplification produced by the classification K which is applied tothe audio signal of the participant, is now shown for this time segmentin the curve of the amplification V. In this case in accordance with thestated framework conditions for the groupings ACT, HG and INACT, anamplification G1 is adopted. Only in the fourth time segment, duringwhich the participant is assigned to the group MUTE will theamplification factor G4 be used through the audio conference device forthe present audio data flow. This corresponds in accordance with thevalue stated here of −94 DB, almost to a muting of the audio data flow.The amplification values G2 and G3 are not included for a conferencewith few participants, in this present case, because a very granulardistinction of the amplification factors does not appear necessary. Bycontrast a finer splitting up of the amplification factors in FIG. 5 isfurther explained.

In FIG. 5 each grouping stage ACT, HG, INACT, MUTE is allocated exactlyone amplification factor. Thus the amplification factor G1 is assignedfor group members of the group ACT. For group members of the group HGthe amplification factor G2 is assigned. A corresponding assignment ismade for the group INACT and MUTE, to which the factors G3 or G4 areassigned. In this case the curve of the amplification factor V—as can beseen from FIG. 5—coincides exactly with the curve of the classificationinformation K.

FIGS. 6 and 7 now present further embodiments of the timing curve shownin FIG. 5. In this case it is especially noted that an abrupt change ofan amplification factor can possibly have negative effects on the voicequality for the communication user. This is why a softer transitionbetween two amplification factors is explained using FIGS. 6 and 7. Thisappears such that, on a switch from a first group to a second group fora short time segment the participant will not be assigned to this groupbut will be briefly administered without group membership. This isindicated by a dotted line in the curve K. During this time theamplification factor can be varied between a start amplification and atarget amplification factor constantly and evenly. Thus a constant curveis produced in the curve V in FIG. 6, with a direct straight connectionexisting for example between two amplification factor values, withreference to which the amplification factors are varied. This produces acontinuous curve of the amplification factors, which has an advantageouseffect on the speech quality on the audio conference.

FIG. 7 shows a similar embodiment, but one which differs from thatdepicted in FIG. 6 however in that, during a transition for oneamplification factor to another, a variation of the amplification factorin discrete steps is undertaken. The restriction to discreteamplification factor values can reduce the complexity of theamplification matching.

In addition it is possibly advantageous to perform an amplificationmatching over different time segments, depending on whether a jumpbetween directly adjacent amplification factors is involved—i.e. aswitch from −6 DB to −15 DB for a classification of −6, −15, −94 DB—orwhether a major change of an amplification factor is involved, i.e. aswitch from −6 DB to −94 DB for example. In addition it can be notedwhether a change in the direction of an attenuation or in the directionof an amplification is involved, whereby it can be advantageous asregards the audio quality produced to make a change of the amplificationfactor in the direction of an attenuation more slowly than a change inthe direction of a positive amplification. In this way a homogenousspeech image can be created and despite this a fast switch on of aspeech channel performed if a participant suddenly becomes an activespeaker.

The division into homogeneous groups in accordance with a classificationis advantageous insofar as the complexity of audio conference device canbe reduced by this. This is especially the case if the audio conferencedata flows to be output to the communication terminals for inactivegroups are formed by means of their communication terminals assigned toaudio data flows from group-conformant superimposed signal-processedaudio data flows, so that for all of the participants assigned to agroup an encoding and a superimposition only need to be undertaken onceand the results of the encoding and the superimposition can be madeavailable for all participants of the group.

Preferably the classification or grouping and the amplification behaviorfor the respective groups can be undertaken as a function of the size ofthe audio conference. It can thus be defined for example by means ofpreconfigured tables how many groups are to be formed for how manyconference participants. This can achieve the result that, forconferences with 3 participants, all participants are to be sorted intoone group, for conferences with 4-8 participants three groups areavailable and for more than 9 participants five groups. Preferably thetransition occurs dynamically over the course of time, so that foraccepting a further participant into a conference with 8 participantsthere is a transition from a division into three groups to a divisioninto five groups.

In a similar way also the amplification factor values can alsopreferably be adapted dynamically as a function of the number ofconference participants, so that for a grouping into in three groups for4-8 participants, different amplification factor values are used for 4-5participants than for 6-8 participants.

The invention is further especially advantageous in that allparticipants of an audio conference can merely be sorted into one singlegroup—for example a group of active speakers. In this way the result canpreferably be achieved, for conferences with few participants, that allparticipant audio signals will be mixed, with all audio signalsundergoing the same signal processing—or no signal processing. Thus onthe one hand compatibility to existing systems and on the other hand alower complexity for these types of conferences with few participants isproduced. In addition—as mentioned above—when a predetermined number ofconference participants is exceeded, the number of groups will beincreased.

In one embodiment of the invention, a damped and/or smoothed signalenergy of one of the audio data flows can preferably be determined asone of the items of classification information by means of filtering ofthe audio data flow by a filter with a Finite Impulse Responder filter—aso-called FIR filter. For example a lowpass filter applied to the signalenergy can achieve a slower behavior as regards the regrouping of theconference participants. As an alternative or in addition for example afirst-order FIR filter can be employed, preferably with different,so-called attack and release coefficients, so that a switch into ahigher category with smaller attenuation is undertaken more quickly thanvice versa, since an FIR filter allows a slow falling off of the signalenergy over a number of time segments.

The embodiments explained in accordance with FIGS. 1-7 are especiallyadvantageous in that a volume adaptation or an ongoing signal processingcan be performed dynamically, so that the speech comprehensibility forthe participants of the audio conference is enhanced. In addition thecomplexity can be kept low because of the grouping of the audio dataflows and the consideration of only a few groups. In addition aprocessing complexity in an audio conference device can be reduced withreference to the procedure explained in the following figures, since anumber of simultaneously used CODECs can be reduced. How a saving inCODECs can be undertaken in such cases will be explained below.

FIGS. 8-12 show schematic block diagrams to illustrate differentprocedural states of the switchover method within the framework of anaudio conference. The audio conference is typically for fiveparticipants with their communication terminals EG1, EG2, EG3, EG4, EG5.Each of the communication terminals EG1, EG2, EG3, EG4, EG5 in this casecomprises a decoder D1, D2, D3, D4, D5 for converting received audiodata AD1, AD2, AD3, AD4 which is transferred from an encoding device KODof an audio conference device with its encoders K1, K2, K3, K4. Thecommunication terminals EG1, EG2, EG3, EG4, EG5 are in this case forexample speech terminals, such as telephones or telephony applicationson a workstation computer, which feature further additional and notshown encoders in each case in order to create audio data from speechsignals picked up by means of a microphone and make them available tothe audio conference device in packet-oriented or circuit-switched form.

The audio conference device possesses an encoding device not shown inthe figure for converting the audio data provided by the communicationterminals EG1, EG2, EG3, EG4, EG5 into audio signals AS and a mixingdevice merely indicated by the “+” sign for mixing or superimposingthese audio signals AS. It may also be that a previously-mentionedclassification of the communication terminals EG1, EG2, EG3, EG4, EG5 ortheir audio data or audio signals AS into homogeneous groups dependingon the audio activity of the participant is undertaken. Furthermore theaudio signals are possibly weighted beforehand or changed by means ofsignal processing in the signal waveform, for example attenuated oramplified (not shown). Mixed audio signals MIXA, MIXB, MIXC, MIXD areproduced as a result of the mixing device which are partly formedspecifically for output to one of the communication terminals EG1, EG2,EG3, EG4, EG5 and partly jointly for output to a number of thecommunication terminals EG1, EG2, EG3, EG4, EG5.

As regards the nomenclature, on the reference symbols “AS” of the audiosignals it is indicated by subscripting the labels of the communicationterminals EG1, EG2, EG3, EG4, EG5 from which of the communicationterminals EG1, EG2, EG3, EG4, EG5 the respective audio signaloriginates.

A state representing a set of values of encoder parameters of one of theencoders K1, K2, K3, K4 is labeled as ZA, ZB, ZC, ZD, with the currentlyactive state ZA, ZB, ZC, ZD in the FIGS. 8-11 being specified for therespective encoder K1, K2, K3, K4 as a subscripted suffix—e.g. K1 _(ZA),K2 _(ZB). In this case on the one hand parameters for a synthesis ofsounds, but especially also intermediate results of a computation withinthe framework of an encoding process, are to be understood as anencoding parameter influencing a state.

Encoding parameters are not shown further in the figures and are forexample one or more tables of setting parameters for a CODEC. A set ofvalues of all table entries of all encoder parameters incl. theintermediate results of the encoding computation is labeled in theexemplary embodiment here as a state, with a change at least of onetable entry or of an intermediate result being designated as a change ofstate.

A state corresponding to a state of an encoder—i.e. a set of values ofencoder parameters—is thus produced for the decoders D1, D2, D3, D4, D5of the communication terminals EG1, EG2, EG3, EG4, EG5. Here too thestate is specified as a subscripted suffix for the decoder referencesymbol, with a decoder state corresponding to an encoder being indicatedwith a dash after the state reference symbol. I.e. the decoder D1, whichis connected to the encoder K1, which in its turn has assumed the stateZA and thus is labeled as encoder K1 _(ZA), will thus be labeled asdecoder D1 _(ZA).

The encoders and decoders are preferably embodied such that they allowanalysis values from further back in time to be included in the analysisof a current speech segment. In one embodiment the encoders and decodersuse a CELP method (CELP: Code-book Excited Linear Predictive Coding). Anexample would be a CODEC in accordance with ITU Recommendation G.728(ITU: International Telecommunication Union).

A state represents for example a stored table of encoding parameters andintermediate results of the encoding computation, which has beenproduced as a result of an analysis of previous audio signal timesegments and is used for an improved encoding/decoding of a currentaudio signal time segment. A loss of this encoder-parameter and/orintermediate result necessary for the respective CODEC or anon-algorithm-conformant variation of these values would in such caseshave a negative and usually perceptible effect on the created audiosignals for output to a communication terminal, since these encoderparameters and the intermediate results were introduced precisely forthe purpose, while further reducing data to be transmitted, forachieving an at least better speech quality than can be achieved withthe same transmission bandwidth without using historical encoderparameters.

In FIGS. 8-12 the connections between encoder and decoder are shown aslines between these components, with a direction of transmission beingindicated in the direction of the transmission by arrowheads. Theseconnections can in such cases be based on packet-oriented and/orcircuit-switched principles.

FIG. 8 represents the initial situation for all of the followingfigures. In an established audio conference between the communicationterminals EG1, EG2, EG3, EG4, EG5 let participants of the communicationterminals EG1, EG2, EG3 be categorized as active speakers, whereasparticipants of the communication terminals EG4, EG5 are simplylisteners. Let the number 3 by the maximum group strength of the groupof active speakers for example for the FIGS. 8-10. Let the group ofthose simply listening be unlimited. An individually mixed conferencesignal is formed for the communication terminals EG1, EG2, EG3 of theactive speakers in each case in which the speech component of theparticipant of the communication terminal for which the mixed audiosignal is intended is filtered out (not shown). The individual mixedaudio signal for the first communication terminal EG1 is MIXA, for thesecond communication terminal EG2 MIXB and for the third communicationterminal EG3 MIXC. The mixed audio signal MIXA in this case ispreferably a superimposing of the picked up audio signals provided bythe communication terminals EG2 and EG3. The mixed audio signal MIXB ispreferably a superimposing of the picked-up audio signals of thecommunication terminals EG1 and EG3, while the mixed audio signal MIXCis a superimposing of the picked-up audio signals AS_(EG1) and AS_(EG2)of the communication terminals EG1 and EG2 In addition a superimposingof all audio signals of all active participants is formed—i.e.AS_(EG1)+AS_(EG2)+AS_(EG3), provided in this nomenclature a “+” isinterpreted as a superimposing operation—with the superimposed mixedaudio signal labeled as MIXD.

The mixed audio signal MIXA is fed to the encoder K1, so that thissignal, at a specific point in time, has encoder parameters inaccordance with state ZA. Similarly a state ZB is produced for theencoder K2 from the application of mixed audio signal MIXB, for theencoder K3 through application of the mixed audio signal MIXC a state ZCand for the encoder K4 through application of the mixed audio signalMIXD a state ZD. The encoder K1, K2, K3, K4 creates the audio data flowsAD1, AD2, AD3, AD4 with the numbering corresponding to the encoders K1,K2, K3, K4. The audio data flows AD1, AD2, AD3 are now each individuallytransferred to the communication terminals EG1, EG2, EG3, whereupon therespective decoders D1, D2, D3 perform a decoding and the respectivestates ZA, ZB, ZB adopt associated decoding states ZA′, ZB′, ZC′.

The mixed audio signal MIXD as a superimposition of the audio signalsAS_(EG1)+AS_(EG2)+AS_(EG3) is fed to the encoder K4 which subsequentlyadopts the state ZD representing its encoder parameters. Audio data AD4generated by the encoder K4 will now be supplied to the twocommunication terminals EG4 and EG5, with their individual decoders D4or D5 each adopting the same decoding state ZD′.

With reference to FIG. 9 starting from the system state shown in FIG. 8,a switch of a speech activity of the participant of the communicationterminal EG2 will now be explained below, with the participant of thecommunication terminal EG2 previously viewed as active being regarded asinactive and being assigned to a corresponding group of inactiveparticipants. There is now the opportunity, as with the merging of thetwo previously inactive participants of the communication terminal EG4and EG5 to also supply the new inactive participant of the communicationterminal EG2 by jointly created audio data. An abrupt, direct switchoverof the decoder input of the decoder D2 to the output of the encoder K4is however—without application of the switchover method—only possiblewith degradations in the speech quality, since the encoder K4 with thestate ZD has a state differing from state ZB of the encoder K2 and alsothe state ZB′ of the decoder D2 does not correspond to the state of theencoder K4.

By means of an embodiment of the method, the state ZB of the encoder K2and thus also the state ZB′ of the decoder D2 is now changed so that thestate ZB approaches the state ZD and the state ZB′ the state ZD′. Ifthere is a match between these pairs of states, the output signal of theencoder K4 can be fed to the input of the decoder D2, withoutperceptible losses in quality occurring.

As shown in FIG. 9, the same mixed audio signal which was fed with thelabel MIXD to the encoder K4, is likewise fed as from a time segment andfor all subsequent time segments to the encoder K2. Furthermore the twoencoders K2 and K4, because of their stored encoder parameters whichhave been produced from the audio signal curves from previous timesegments, possess differing states ZB and ZD for the time segments. Ifhowever it is now assumes that for a CODEC such as the encoders K2 andK4 time segments lying further back have a far smaller influence on theencoder parameters than a current or just elapsed time segment, theresult is that the encoder parameters and thus the state ZB of theencoder K2 approach the values of the encoder parameters of the encoderK4, until at a future time segment exact or, taking into accounttolerances, if necessary largely matching encoding parameters and thusalso a match between the states ZB and ZD of the encoders K2 and K4arises.

This is fulfilled in the time segment on which FIG. 10 is based. In thistime segment state ZB of the encoder K2 has approached the state ZDassumed in the same time segment of encoder K4, so that a switchover ofthe input of the decoder D2 to the output of the encoder K4 is possiblewithout quality problems. In the current or in a future time segment theaudio connection to communication terminal EG2 is now switched over suchthat a switch is made from encoder K2 as original source of the audioconnection to encoder K4. The communication terminal EG2 and thus thedecoder D2 thus receives the audio data AD4 fed via the audio dataconnection, precisely like the communication terminals EG4 and EG5. Thestate adopted by the decoder D2 also matches the respective state ZD′ ofthe decoders D4 and D5.

In order to save on computing effort and encoder resources in theencoding device KOD, the encoder K2 can now be deactivated, released orremoved. Feeding of the mixed signal MIXB can thus likewise be ended.Both are indicated in FIG. 10 by striking through the reference symbolsMIXB and K2.

The switchover method explained above is especially advantageous fordynamically assembled encoding devices, in which encoders aredynamically allocated to the audio signals of the audio conference andthus can also be dynamically released again. In this way it may bepossible to save on one encoder by switchover to an alternate encoder.Saving on or deactivating an encoder is however advantageous in as muchas the processing effort in the encoding device can be reduced in thisway, especially for use of complex CODECs which make high demands oncomputing power.

FIG. 11 illustrates a further embodiment of the switchover methodstarting from the procedural state adopted in FIG. 8. Let the maximumgroup strength of the group of active speakers be 4 for example. Let thegroup of those just listening be unlimited. In this case the participantof the communication terminal EG5 becomes active and therefore needs thegeneration of a specific mixed audio signal, in which all audiocomponents of the other communication terminals EG1, EG2, EG3, EG4 aresuperimposed. Thus in the encoding device KOD a new encoder K5, providedspecifically for an audio connection to communication terminal EG5 isgenerated or activated, to which a mixed signal MIXE with asuperimposition of the audio signals AS_(EG1)+AS_(EG2)+AS_(EG3)+AS_(EG4)is fed. As a result of the new creation of the encoder K5—and therebyalso of a new state ZE—by contrast with the encoders K1 to K4 shown assolid rectangles, this is shown as a dashed-line rectangle in FIG. 11.

If now only the encoder K5 were created, without its encoder parametersand thus its state being adapted, a discontinuity of its decoderparameters would be produced for decoder D5 which would have the effectof reducing the speech quality or producing decoding errors. In order toavoid this, the method described below results in the state of thedecoder D5 and thus of its decoding parameters continuing to becontinuously changed although the audio connection in progress atdecoder D5 from encoder K4 to encoder K5 is abruptly switched over.

This is achieved by, after the mixed signal MIXE and the encoder K5 havebeen created, the encoder parameters and thus the state ZD of theencoder K4 being detected and placed in the same time segment for theencoder K5. This is preferably done by means of a copying processCP—indicated in FIG. 11 by an arrow from encoder K4 to encoder K5.Encoder K5 thus assumes the state ZD without delay and encodes on thebasis of this state the incoming mixed signal MIXE. Although this meansthat the encoding process of the encoder K5 begins suddenly, thisdiscontinuous behavior is not perceived at the decoder D5 of thecommunication terminal D5, provided a switchover of the audio connectionis likewise performed in the same time segment, so that the audio dataAD5 generated by encoder K5 is fed to the decoder D5. This is now shownin FIG. 12. The decoder D5 has the state ZD′ at the switchover point ofthe audio connection. Since this corresponds to the state ZD of theencoders K4 and K5 the decoding process is thus not disturbed by aswitchover from encoder K4 to encoder K5, so that no perceptible errorsin the decoding by the decoder D5 occur. Because of the final switchoverto K5 and the connection to decoder D5, in FIG. 12 the encoder K5 is nowshown as a solid-line rectangle.

The state ZD of the encoder K5 and the state ZD′ of the decoder D5assumed in FIG. 12 only applies at the switchover point. In thefollowing time segments the encoder K5 can by contrast assume specificstates, depending on the mixed signal MIXE. And the decoder D5 willassume the corresponding different states likewise dependent on thestate of the decoder D4.

In combination with the process explained with reference to FIGS. 8-12,encoders of an encoding device can be dynamically switched off andswitched on, by other encoders taking over the encoding tasks and audioconnections being switched between the encoders and decoders ofcommunication terminals.

As well as classical audio conferences a use in other telephone servicesis also conceivable, in which a number of participants simultaneouslysometimes receive the same and sometimes different audio signals, andpartly a switch of the audio signals takes place. For example these areannouncement services in which a plurality of participants are played arecorded announcement—for example a promotional message—or are playedmusic-on-hold. In this case a number of participants can be temporarilytransmitted a common signal via a common encoder, with for example onswitching through to a call center agent a participant-specific encoderbeing activated and the audio connection switched over to this encoder.Thus a reduction in the number of encoders simultaneously active can beachieved for the reproduction of similar-type recorded announcements andtone sequences.

The invention claimed is:
 1. A method for carrying out an audioconference, comprising: an audio conference device receiving a pluralityof audio data flows from communication devices of conferenceparticipants; the audio conference device evaluating classificationinformation for a first time segment provided in the audio data flows;the audio conference device determining a number of groups to which theaudio data flows are to be assigned based on the received audio dataflows, the groups comprising at least a first group of the audio dataflows received from the communication devices, a second group of theaudio data flows received from the communication devices, and a thirdgroup of the audio data flows received from the communication devices;for each of the audio data flows, the audio conference devicedetermining which of the groups that the audio data flow is to beassigned for the first time segment such that the audio data flowsassigned to the first group have a homogenous result from the evaluationof the classification information of the first time segment of the audiodata flows and the audio data flows assigned to the second group have ahomogenous result from the evaluation of the classification informationof the first time segment of the audio data flows and the audio dataflows assigned to the third group have a homogenous result from theevaluation of the classification information of the first time segmentof the audio data flows; the audio conference device processing thefirst time segment of the received audio data flows with group specificattenuation such that the audio data flows assigned to the first groupare attenuated differently than the audio data flows assigned to thesecond group for the first time segment and the audio data flowsassigned to the second group are attenuated differently than the audiodata flows assigned to the third group for the first time segment andthe audio data flows assigned to the third group are attenuateddifferently than the audio data flows assigned to the first group forthe first time segment; the audio conference device mixing orsuperimposing the processed received audio data flows to form audioconference data flows; the audio conference device sending the formedaudio conference data flows to the communication terminals; and whereinthe processing of the audio data flows assigned to the first group,second group, and third group occurs such that a volume of the audiodata flows of the first group is higher than a volume of the audio dataflows assigned to the second group and the volume of the audio dataflows assigned to the second group is higher than a volume of the audiodata flows assigned to the third group; and wherein the determining ofthe number of groups and the determining of which of the groups that theaudio data flows are to be assigned for the first time segment isperformed by a classification device of the audio conference device andwherein the processing of the first time segment of the received audiodata flows is performed by an amplification device of the audioconference device; and for each group, the amplification deviceattenuating the audio data flows with group specific attenuation basedon a number of total audio data flows within the group.
 2. The method asclaimed in claim 1 wherein the processing of the audio data flowsassigned to the second group occurs such that these audio data flows areattenuated or amplified such that the volume of the audio data flowsassigned to the second group is less than the volume of the audio dataflows assigned to the first group.
 3. The method as claimed in claim 1,wherein the evaluation of the classification information includes anassessment of a predeterminable group preselection identifierindividually assigned to one of the audio data flows so that anassignment to a preselected group is forced depending on the grouppreselection identifier.
 4. The method as claimed in claim 1, whereinthe evaluation of the classification information includes an assessmentof the number of audio data flows to be assigned to a group, and by thisnumber being exceeded in relation to a maximum threshold value of audiodata flows assigned to a group, an assignment into an alternate group isforced.
 5. The method as claimed in claim 1, further comprising: theclassification device sending group information to a grouping unit ofthe audio conference device, the grouping unit impressing the first timesegment of the audio data flows with grouping information based upon thegroup information received from the classification device; and theclassification device sending amplification information to theamplification device such that the amplification device processes thefirst time segment of the audio data flows based upon the groupinginformation impressed by the grouping unit, the amplification deviceapplying a first amplification factor for the audio data flows assignedto the first group identified via the amplification information and asecond amplification factor for the audio data flows assigned to thesecond group identified via the amplification information, the firstamplification factor differing from the second amplification factor. 6.The method as claimed in claim 1, further comprising: the classificationdevice sending group information and amplification information to theamplification device such that the amplification device processes thefirst time segment of the audio data flows based upon the groupinginformation and the amplification information, the amplification deviceapplying a first amplification factor for the audio data flows assignedto the first group identified via the amplification information andgrouping information and a second amplification factor for the audiodata flows assigned to the second group identified via the amplificationinformation and grouping information, the first amplification factordiffering from the second amplification factor.
 7. The method as claimedin claim 1, wherein the groups are also comprised of a fourth group ofaudio data flows and a fifth group of audio data flows, the audio dataflows assigned to the fifth group being muted.
 8. The method as claimedin claim 1, wherein the audio data flows of the first group, the secondgroup, and the third group are each attenuated to define a transitionbetween the first group and the second group and to define a transitionbetween the second group and the third group.
 9. The method as claimedin claim 1, further comprising: the audio conference device evaluatingclassification information for a second time segment provided in theaudio data flows; for each of the audio data flows, the audio conferencedevice determining which of the groups that the audio data flow is to beassigned for the second time segment such that the audio data flowsassigned to the first group have a homogenous result from the evaluationof the classification information of the second time segment of theaudio data flows and the audio data flows assigned to the second grouphave a homogenous result from the evaluation of the classificationinformation of the second time segment of the audio data flows and theaudio data flows assigned to the third group have a homogenous resultfrom the evaluation of the classification information of the second timesegment of the audio data flows; and the audio conference deviceprocessing the second time segment of the received audio data flows suchthat the audio data flows assigned to the first group for the secondtime segment are processed differently than the audio data flowsassigned to the second group for the second time segment and the audiodata flows assigned to the second group are processed differently thanthe audio data flows assigned to the third group for the second timesegment and the audio data flows assigned to the third group areprocessed differently than the audio data flows assigned to the firstgroup for the second time segment.
 10. The method as claimed in claim 9,wherein one of the audio data flows assigned to the third group for thefirst time segment is assigned to the first group or the second groupfor the second time segment based upon an evaluation of a filteredsignal energy of that audio data flow.
 11. The method as claimed inclaim 9, wherein the evaluation of the classification information of theaudio data flows for the second time segment is carried out with aninclusion of the evaluation of the classification information of theaudio data flows from the first time segment.
 12. The method as claimedin claim 9, wherein one of the audio data flows assigned to the thirdgroup in the first time segment is assigned to the second group or thefirst group in the second time segment and upon a detection of achanging of group assignment for that audio data flow, an individualattenuation to that audio data flow is provided starting from anattenuation provided for the third group that is changed in accordancewith a monotonous function constantly or is changed in accordance withthe monotonous function in discrete steps until that audio data flow isattenuated at a second attenuation of the first group or second group.13. The method as claimed in claim 9, wherein one of the audio dataflows assigned to the third group in the first time segment is assignedto the second group or the first group in the second time segment andwherein a detection of the changed assignment for that audio data flowoccurs only after a plurality of time intervals pass in which that audiodata flow is assigned to the second group or the first group.
 14. Themethod as claimed in claim 1, wherein the classification information iscomprised of signal loudness level of the audio data flows.
 15. Themethod as claimed in claim 1, wherein the first group is associated withactive participants of a conference call, the second group is associatedwith inactive participants of the conference call, and the third groupis associated with muted participants of the conference call.